Inter-class relationships in text classification

نویسندگان

  • Shantanu Godbole
  • Soumen Chakrabarti
چکیده

Text classification is an active research area motivated by many real-world applications. Even so, research formulations and prototypes often make assumptions that are not suitable for deployment. For example, in many real applications, the set of class labels keeps evolving, continual user feedback must be integrated into the classifier, and test documents may come from a population statistically different from the training distribution. The main aim of our work is to build solutions for these problems using the idea of exploiting inter-class relationships. We learn noisy, approximate, and probabilistic mappings between related classes across label-sets in a semi-supervised framework we call cross-training. We exploit the notion of confusion between closely related classes, study its effect on label hierarchies, and present an algorithm for scaling up training of multi-class classifiers. We design discriminative, multi-label classifiers that are robust in the face of significant overlap, in terms of word distributions, between related classes. In many real applications, the set of labels is not predefined but must be constructed from vague specifications and a study of the corpus. Moreover, the label-set has to keep evolving as the corpus changes. We propose an algorithm that supports such temporal evolution by detecting classes in unseen data not defined during training. Our algorithm detects such classes using new notions of coverage of label-sets, support and confidence in a classification setting, and abstractions to represent documents. To enable continual interactive learning and to incorporate human input, we present a framework for active learning that combines terms and documents in a symmetric manner, reducing cognitive burden on the trainer. We conclude by proposing a new architecture for next-generation text classification platforms that embodies the ideas and contributions in this dissertation. To summarize, our work fills in conspicuous gaps between research prototypes and industry requirements, by exploiting one central idea: class labels are mutable variates just like words, documents and their assigned labels.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Associations between Class Labels in Multi-label Classification

Multi-label classification has many applications in the text categorization, biology and medical diagnosis, in which multiple class labels can be assigned to each training instance simultaneously. As it is often the case that there are relationships between the labels, extracting the existing relationships between the labels and taking advantage of them during the training or prediction phases ...

متن کامل

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...

متن کامل

Information Gain Feature Selection for Ordinal Text Classification using Probability Re-distribution

This paper looks at feature selection for ordinal text classification. Typical applications are sentiment and opinion classification, where classes have relationships based on an ordinal scale. We show that standard feature selection using Information Gain (IG) fails to identify discriminatory features, particularly when they are distributed over multiple ordinal classes. This is because inter-...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

A Text Watermarking Algorithm based on Word Classification and Inter-word Space Statistics

Text documents can be watermarked by patterning the inter-word spaces. This paper proposes a text watermarking algorithm that exploits the novel concepts of word classification and inter-word space statistics. The words are classified using some features. Several adjacent words are grouped into a segment, and the segments are also classified using the word class information. The same amount of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006